A Standard notion about science is that it works by falsification (aka, Popperianism)
Scientific hypotheses are not models
A scientific hypothesis can be composed of many proposed processes
A scientific process may or may not be able to be directly modeled statistically, but for the most part statistical models are not scientific process models.
So, one scientific hypothesis can relate to many process models which can be approximated using statistical models. Later on, there is a discussion of the Ptolemaic geocentric view of the solar system, which is akin to a statistical model that has much predictive power but absolutely no explanatory power. This is because, while you can create an arbitrarily complex (and therefore accurate) epicycle-based models (circles on circles), these models do not provide insight to the process that generates the phenomenon.
The statistical model can quickly become the “geocentric” model if care isn’t taken to think scientifically.
Science is consensual not logical. The Ptolemaic view was ultimately rejected by consensus, not logic, since it made precisely accurate predictions without explaining accurately how orbits worked.
Part of this book builds up methods and critical thinking skills needed to ensure that we think scientifically and not purely statistically, becuase that is when we are most likely to fall prey to the allure of statistical models and all the baggage they carry with them
Bayes inference is no more than counting the number of ways things can happen according to a set of assumptions - the larger the count, the higher the plausibility.
This directly relates to the idea of likelihood: given data, what model provides the best explanation for the data as defined by how likely the data is to occur given the model itself.
Frequentism is a subset of Bayesian inference that states that probabilities are defined as the connection of countable events and their frequencies in very large samples:
The idea of a “population” distribution that is constantly being “sampled” by a “sampling distribution.”
This “population distribution” does not have uncertainty, only the measurements can (the sample)
As the name implies, these are models contains within models. In other words, a model that is parameterized by other models. This leads to varying levels of uncertainty depending on the level of interest.
Why use them?
To adjust estimates for repeat sampling - using the same entity for multiple observations
To adjust estimates for imbalance in sampling - using some entities more often than others
To study variation
To avoid averaging - example: averaging a variable in a regression analysis —> lose some variation
Information Criteria, such as the AIC (Akaike Information Criteria), allow us to compare models based on future predictive accuracy.
Overfitting: “Fitting is easy; prediction is hard. Future data will not be exactly like past data, and so any model that is unaware of this fact tends to make worse predictions than it could.” The point being that prediction accuracy is different than fit.
“Small world” refers to the almost game-ified world where logic prevails
“Large world” refers to the real world where a model is deployed. This world is complex and significantly bigger than the small world, so there can be missing information in the “small world”
tidyverse code:library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## Warning: package 'dplyr' was built under R version 4.0.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
d <- tibble(p1 = 0,
p2 = rep(1:0, times = c(1, 3)),
p3 = rep(1:0, times = c(2, 2)),
p4 = rep(1:0, times = c(3, 1)),
p5 = 1) %>%
set_names(1:5) %>%
mutate(x = 1:4) %>%
pivot_longer(-x, names_to = "possibility") %>%
mutate(value = value %>% as.character()) %>%
ggplot(aes(x = x, y = possibility, fill = value)) +
geom_point(shape = 21, size = 5) +
scale_fill_manual(values = c("white", "navy")) +
scale_x_discrete(NULL, breaks = NULL) +
theme(legend.position = "none")
d
d <-tibble(position = c((1:4^1) / 4^0, (1:4^2) / 4^1, (1:4^3) / 4^2),
draw=rep(1:3, times = c(4^1, 4^2, 4^3)),
fill=rep(c("b", "w"), times = c(1, 3)) %>%
rep(., times = c(4^0 + 4^1 + 4^2))
) %>%
mutate(denominator = ifelse(draw == 1, .5,
ifelse(draw == 2, .5 / 4,
.5 / 4^2))) %>%
mutate(position = position - denominator)
lines_1 <-tibble(x=rep((1:4), each = 4),
xend = ((1:4^2) / 4),y=1 ,yend=2) %>%
mutate(x=x-0.5,
xend=xend-0.5/4^1)
lines_2 <-tibble(x=rep(((1:4^2) / 4), each=4),
xend=(1:4^3)/(4^2), y=2, yend=3) %>%
mutate(x=x-0.5/4^1,
xend=xend-0.5/4^2)
plot <- d %>%
ggplot(aes(x = position, y = draw)) +
geom_segment(data = lines_1,
aes(x = x, xend = xend,
y = y, yend = yend),
size = 1/3) +
geom_segment(data = lines_2,
aes(x = x, xend = xend,
y = y, yend = yend),
size = 1/3) +
geom_point(aes(fill = fill),
shape = 21, size = 3) +
scale_fill_manual(values = c("navy", "white")) +
scale_y_continuous(breaks = 1:3) +
# coord_polar() +
theme(legend.position = "none",
panel.grid.minor = element_blank())
plot
as you can see, there are 3 possible ways for us to draw bwb if we assume the bag has only 1 blue marble.
If you do this for bags with 2 and 3 blue marbles you get 8, and 9 possible ways:
Bayesian Counting of All Possibilities
Now, if we draw a fourth marble after replacement and this marble is blue, what do we do?
We now go through Bayesian Updating of the prior.
Bayesian Updating - multiply the previous “Ways to produce” by the number of ways to product another blue marble for each possible marbled combination bags. In this case, the prior is the previous “Ways to produce”
This procedure of updating allows us to (a) acquire new observations that we fold into our counting while (b) still accounting for the prior observations.
Let’s show the visual representation of all 3 possible bag compositions (<b,w,w,w>, <b,b,w,w>, <b,b,b,w>). The code is complex and not necessary to understand the story that is being told, but feel free to interrogate it on your own!
d <-tibble(position = c((1:4^1) / 4^0,
(1:4^2) / 4^1,
(1:4^3) / 4^2),
draw = rep(1:3, times = c(4^1, 4^2, 4^3)))
d <- d %>%
bind_rows(
d, d
) %>%
# here are the fill colors
mutate(fill = c(rep(c("w", "b"), times = c(1, 3)) %>% rep(., times = c(4^0 + 4^1 + 4^2)),
rep(c("w", "b"), each = 2) %>% rep(., times = c(4^0 + 4^1 + 4^2)),
rep(c("w", "b"), times = c(3, 1)) %>% rep(., times = c(4^0 + 4^1 + 4^2)))) %>%
# now we need to shift the positions over in accordance with draw, like before
mutate(denominator = ifelse(draw == 1, .5,
ifelse(draw == 2, .5 / 4,
.5 / 4^2))) %>%
mutate(position = position - denominator) %>%
# here we'll add an index for which pie wedge we're working with
mutate(pie_index = rep(letters[1:3], each = n()/3)) %>%
# to get the position axis correct for pie_index == "b" or "c", we'll need to offset
mutate(position = ifelse(pie_index == "a", position,
ifelse(pie_index == "b", position + 4,
position + 4 * 2)))
move_over <- function(position, index) {
ifelse(
index == "a", position,
ifelse(
index == "b", position + 4, position + 4 * 2
)
)
}
lines_1 <-
tibble(x = rep((1:4), each = 4) %>% rep(., times = 3),
xend = ((1:4^2) / 4) %>% rep(., times = 3),
y = 1,
yend = 2) %>%
mutate(x = x - .5,
xend = xend - .5 / 4^1) %>%
# here we'll add an index for which pie wedge we're working with
mutate(pie_index = rep(letters[1:3], each = n()/3)) %>%
# to get the position axis correct for `pie_index == "b"` or `"c"`, we'll need to offset
mutate(x = move_over(position = x, index = pie_index),
xend = move_over(position = xend, index = pie_index))
lines_2 <-
tibble(x = rep(((1:4^2) / 4), each = 4) %>% rep(., times = 3),
xend = (1:4^3 / 4^2) %>% rep(., times = 3),
y = 2,
yend = 3) %>%
mutate(x = x - .5 / 4^1,
xend = xend - .5 / 4^2) %>%
# here we'll add an index for which pie wedge we're working with
mutate(pie_index = rep(letters[1:3], each = n()/3)) %>%
# to get the position axis correct for `pie_index == "b"` or `"c"`, we'll need to offset
mutate(x = move_over(position = x, index = pie_index),
xend = move_over(position = xend, index = pie_index))
d <- d %>%
mutate(remain = c(#pie_index == "a"
rep(0:1, times = c(1, 3)),
rep(0, times = 4),
rep(1:0, times = c(1, 3)) %>% rep(., times = 3),
rep(0, times = 4 * 4),
rep(c(0, 1, 0), times = c(1, 3, 4 * 3)) %>% rep(., times = 3),
# pie_index == "b"
rep(0:1, each = 2),
rep(0, times = 4 * 2),
rep(1:0, each = 2) %>% rep(., times = 2),
rep(0, times = 4 * 4 * 2),
rep(c(0, 1, 0, 1, 0), times = c(2, 2, 2, 2, 8)) %>% rep(., times = 2),
# pie_index == "c",
rep(0:1, times = c(3, 1)),
rep(0, times = 4 * 3),
rep(1:0, times = c(3, 1)),
rep(0, times = 4 * 4 * 3),
rep(0:1, times = c(3, 1)) %>% rep(., times = 3),
rep(0, times = 4)
)
)
lines_1 <- lines_1 %>%
mutate(remain = c(rep(0, times = 4),
rep(1:0, times = c(1, 3)) %>% rep(., times = 3),
rep(0, times = 4 * 2),
rep(1:0, each = 2) %>% rep(., times = 2),
rep(0, times = 4 * 3),
rep(1:0, times = c(3, 1))
)
)
lines_2 <- lines_2 %>%
mutate(remain = c(rep(0, times = 4 * 4),
rep(c(0, 1, 0), times = c(1, 3, 4 * 3)) %>% rep(., times = 3),
rep(0, times = 4 * 8),
rep(c(0, 1, 0, 1, 0), times = c(2, 2, 2, 2, 8)) %>% rep(., times = 2),
rep(0, times = 4 * 4 * 3),
rep(0:1, times = c(3, 1)) %>% rep(., times = 3),
rep(0, times = 4)
)
)
d %>%
ggplot(aes(x = position, y = draw)) +
geom_vline(xintercept = c(0, 4, 8), color = "white", size = 2/3) +
geom_segment(data = lines_1,
aes(x = x, xend = xend,
y = y, yend = yend,
alpha = remain %>% as.character()),
size = 1/3) +
geom_segment(data = lines_2,
aes(x = x, xend = xend,
y = y, yend = yend,
alpha = remain %>% as.character()),
size = 1/3) +
geom_point(aes(fill = fill, size = draw, alpha = remain %>% as.character()),
shape = 21) +
scale_fill_manual(values = c("navy", "white")) +
scale_size_continuous(range = c(3, 1.5)) +
scale_alpha_manual(values = c(0.2, 1)) +
scale_x_continuous(NULL, limits = c(0, 12), breaks = NULL) +
scale_y_continuous(NULL, limits = c(0.75, 3.5), breaks = NULL) +
coord_polar() +
theme(legend.position = "none",
panel.grid = element_blank())
If we go back to the first part of this thought exercise where we draw bwb marbles, recognize that we used the least informative prior, where we assumed all possibilities of bag composition were equally likely. If we had further information that we could have constructed a more precise prior, but we didn’t. This is called the principle of indifference.
Helpful definitions that are already developed in this framework:
Parameter - The conjecture or, in other words, the possible explanations of the data (i.e., the possible bag contents)
Likelihood - The relative number of ways that a given parameter can produce the observed data (i.e., the counting exercise)
Prior Probability - The previous plausibility of any specific parameter value
Posterior Probability - The updated plausibility of any specific parameter value given new data
Binomial Coefficient (i.e., \(n\) objects choose \(k\) objects): \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\)
In the marble example above, say we have 4 marbles and we select 3 of them. Therefore:
\(\binom{4}{3} = \frac{4!}{3!(4-3)!} = \frac{4\times3\times2\times1}{3\times2\times1\times1} = 4\)
So, this shows us that with 4 marbles in the bag, there are 4 ways to select 3 marbles
1,2,3
1,3,4
2,3,4
1,2,4
Binomial Distribution:
Say we wanted to compute the probability of selecting 2 blues in 3 tosses if there were 50% chance of pulling a blue (2 blues 2 whites)
\(Pr(w|n,p) = \frac{n!}{w!(n-w)!}\times p^{w}(1-p)^{(n-w)}\) so therefore \(Pr(2|3,.5) = \frac{3!}{2!(3-2)!}\times .5^{2}(1-.5)^{(3-2)} = \frac{3\times2\times1}{2\times1\times1}\times 0.5^2(.5^1) = 3 \times .5^3 = .375\)
The interpretation: there are 4 ways to get 2 blues in 3 tosses, each way has 37.5% chance of happening
This can be calculated using a simple R function dbinom
dbinom(x=2, size=3, prob=.5)
## [1] 0.375
A purely mathematical notion of posteriors, likelihoods, and priors can be written as follows:
\(Pr(w,p) = Pr(w|p)Pr(p)\)
To make this concrete, lets say \(w\) represents a rainy day and \(p\) represents a cold day. So, the probability of experiencing a rainy day AND a cold day (the “Joint” probability) is equivalent to the probability of experiencing a rainy day when it is cold (the likelihood) multiplied by the probability of a cold day (the prior). This is equivalent to the following:
\(Pr(w,p) = Pr(p|w)Pr(w)\)
Which states that the probability of experiencing a rainy day AND a cold day (the “Joint” probability) is equivalent to the probability of experiencing a cold day when it is rainy (the likelihood) multiplied by the probability of a rainy day (the prior). Note how these two conceptions are identical. So:
\(Pr(w,p) = Pr(w|p)Pr(p) = Pr(p|w)Pr(w)\)
and this equation is equivalent to:
\(Pr(p|w) = \frac{Pr(w|p)Pr(p)}{Pr(w)}\)
which is Bayes Theorem. Don’t worry too much about the denominator, all it does is standardize the posterior (the \(Pr(p|w)\)) such that all posteriors sum to 1.
THE KEY IS THAT THE POSTERIOR IS PROPORTIONAL TO THE PRODUCT OF THE PRIOR AND THE LIKELIHOOD.
a statistical model, in it’s simplest form, is a little machine that does exactly what it is programmed. A simple form of a model for Bayesian updating is a little engine that takes a prior distribution, multiplies it by a likelihood, and returns the posterior distribution.
The three methods we can use to create this Bayesian updating model include:
Grid approximation
Quadratic approximation
Markov chain Monte Carlo (MCMC)
Grid Approximation uses a finite grid of values to approximate the parameter distributions
Grid Approximation is helpful for conceptual understanding but it does not scale
We are showing the posterior distribution of the parameter, \(p\), which represents say the proportion of water on a globe. In this thought exercise, we just randomly spun a globe and dropped a pin on it, and 5 of the 7 spins ended up with the pin in the ocean. So, the following posterior distribution shows the plausibility of values for \(p\) given this information.
GRID_LEN = 100
# define grid of parameter values
# This is 100 discrete numbers between 0 and 1
p_grid <- seq(from=0, to=1, length.out=GRID_LEN)
# define the prior
prior <- rep(1, GRID_LEN)
# compute likelihood at each value in grid
likelihood <- dbinom(x=5, size=7, prob=p_grid)
# compute product of likelihood and prior
unstd.posterior <- likelihood * prior
# standardize the posterior, so it sums to 1
posterior <- unstd.posterior / sum(unstd.posterior)
# Plot the distributions
plot(p_grid, posterior, type="b",
xlab = "probability of water", ylab = "posterior prob")
mtext(paste(GRID_LEN,"points"))
Quadratic approximations scales better than grid approximation
It works in two steps
Use some sort of gradient ascent algorithm to find the peak
estimate the curvature near the peak and then the quadratic approximation (reminiscent of a Taylor series expansion)
library(rethinking)
## Loading required package: rstan
## Loading required package: StanHeaders
## rstan (Version 2.21.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
##
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
##
## extract
## Loading required package: parallel
## Loading required package: dagitty
## rethinking (Version 2.01)
##
## Attaching package: 'rethinking'
## The following object is masked from 'package:purrr':
##
## map
## The following object is masked from 'package:stats':
##
## rstudent
globe.qa <- quap(
alist(
w ~ dbinom(9,p),
p ~ dunif(0,1)
),
data = list(w=6)
)
# display the summary of the quadratic approximation
precis(globe.qa)
# wrangle
tibble(w = c(6, 12, 24),
n = c(9, 18, 36),
s = c(.16, .11, .08)) %>%
expand(nesting(w, n, s),
p_grid = seq(from = 0, to = 1, length.out = GRID_LEN)) %>%
mutate(prior = 1,
m = .67) %>%
mutate(likelihood = dbinom(w, size = n, prob = p_grid)) %>%
mutate(unstd_grid_posterior = likelihood * prior,
unstd_quad_posterior = dnorm(p_grid, m, s)) %>%
group_by(w) %>%
mutate(grid_posterior = unstd_grid_posterior / sum(unstd_grid_posterior),
quad_posterior = unstd_quad_posterior / sum(unstd_quad_posterior),
n = str_c("n = ", n)) %>%
mutate(n = factor(n, levels = c("n = 9", "n = 18", "n = 36"))) %>%
# plot
ggplot(aes(x = p_grid)) +
geom_line(aes(y = grid_posterior)) +
geom_line(aes(y = quad_posterior),
color = "grey50") +
labs(x = "proportion water",
y = "density") +
theme(panel.grid = element_blank()) +
facet_wrap(~ n, scales = "free")
Bayes theorem is often taught through the lense of a medical test:
Ex: What is the probability of having covid given a postive pcr test?
Let \(Pr(pos|covid) = 0.98\) (link)
Let \(Pr(pos|health) = 0.004\) (link)
Let \(Pr(covid) = 0.05\) (just assume 5% of people actively have covid in your area)
Using Bayes theorem: \[ Pr(covid|pos) = \frac{Pr(pos|covid)Pr(covid)}{Pr(pos)} \]
where \(Pr(pos) = Pr(pos|covid)Pr(covid) + Pr(pos|healthy)Pr(healthy)\) So: \(Pr(covid|pos) = \frac{0.98 \times 0.25}{0.98 \times 0.05 + 0.004 \times 0.75} = 98.8\%\)
It’s kind of unnatural to think like this, but lets convert it to frequencies instead
In population of 100,000 people, 25,000 have covid
Out of the 25,000 who have covid, 24,500 test positive
of the 75,000 healthy people, 300 test positive
So, then the thought is much more straightforward
\[ Pr(covid|pos) = \frac{24,500}{24,500+300} = 98.8\% \]
This frequency format allows us to exploit the use of sampling, in addition to the fact that sampling allows us to use more advanced methods down the line like Markov chain Monte Carlo, which is a sampling-based procedure that will be discussed later.
Really interesting “Rethinking” section in the book worth discussing here. THIS IS SOMETHING WORTH REMEMBERING FOR FUTURE STATISTICAL WORK.
Scientific inferences are often structured in the following way
A hypothesis is either true or false
We use a statistical procedure to get an imperfect clue of the hypothesis’ falsity
We use Bayes’ theorem to logically deduce the impact of this clue on the state of the hypothesis
We don’t often see step 3. get completed in real life though … All one needs to do is to think about how often research is proven wrong even though the study that generated the results presented a small \(\alpha\) value. Just because an alpha for a statistical model is low does not mean the process model or the scientific hypothesis is correct.
Below, we are going to manufacture a posterior distribution and then sample from it (the first plot). Then we look at the distribution of values in the second plot.
GRID_LEN = 1000
p_grid <- seq(from=0, to=1, length.out=GRID_LEN)
prior <- rep(1, GRID_LEN)
likelihood <- dbinom(6, size=9, prob=p_grid)
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
## Now let's sample this posterior distribution
SAMPLE_SIZE = 2e5
samples <- sample(p_grid, prob=posterior, size=SAMPLE_SIZE, replace=T)
plot(samples)
dens(samples)
The first plot is as if you were flying over the samples from the posterior distribution. The second plot is if you were looking at it from the side with all the samples stacked on top of each other.
well thats easy:
sum(posterior[p_grid<0.5])
## [1] 0.1718746
but, if we didn’t have the grid approximation posterior, and all we had were samples from the posterior, then its still pretty much just as easy:
sum(samples<0.5)/SAMPLE_SIZE
## [1] 0.17265
clearly the sample based method is not exact, but it is really close.
sum(samples > 0.5 & samples < 0.75)/SAMPLE_SIZE
## [1] 0.604655
quantile(samples, 0.8)
## 80%
## 0.7597598
quantile(samples, c(0.1,0.9))
## 10% 90%
## 0.4484484 0.8108108
We have seen above that there are way to describe the posterior distribution using density/mass over a specified region of possible parameter values, as well as specifying the mass and then finding the areas that comply with that mass. Is there a way to summarize the distribution using a single point? kind of, yes.
We can use a principled approach to selecting a point estimate if we employ a loss function. Say we wanted to pick a point estimate that best estimates the parameter such that you get $100 if you’re right. If you’re wrong, a proportional amount is subtracted from the winnings.
So, we want \(|d - p|\) where \(d\) is the point estimate and \(p\) is the actual parameter value. What \(d\) do you pick? well let’s use our grid approximation to determine that
loss <- sapply(p_grid, function(d) sum(posterior*abs(d-p_grid)))
loss vector above shows us the loss value over each possible parameter $p$ in our grid. What is the minimum?p_grid[which.min(loss)]
## [1] 0.6446446
Likelihoods are bidirectional such that:
Given an observation, the likelihood tells us the most plausible parameter (or, more accurately, the plausibilities of all possible parameters)
Given a parameter, the likelihood gives us a distribution of possible observations to sample from (SIMULATION)
This is what it means when we say that Bayesian models are generative.
From bullet 1. above:
dbinom(0:2, size=2, prob=0.7)
## [1] 0.09 0.42 0.49
So, this means that given $p=0.7$, there is a 9% chance of observing an observation of value 0, a 42% chance of observing an observation with value 1, and a 49% chance of observing an observation with value 2.
rbinom generates random samples from a binomial distribution.rbinom(10, size=2, prob=0.7)
## [1] 0 1 1 1 2 2 2 2 2 2
SAMPLE_SIZE = 1e5
dummy_w <- rbinom(SAMPLE_SIZE, size=2, prob=0.7)
table(dummy_w)/SAMPLE_SIZE
## dummy_w
## 0 1 2
## 0.09163 0.41908 0.48929
SAMPLE_SIZE = 1e5
SIZE = 9
PROB = 0.7
dummy_w <- rbinom(SAMPLE_SIZE, size=SIZE, prob=PROB)
simplehist(dummy_w, xlab='dummy water count')
prob argument, but instead we pass it all the samples from the parameter’s posterior distribution. Notice how w_bad in the following code chunk is far more certain about the value of water globe tosses compared to w_good.w_bad <- rbinom(SAMPLE_SIZE, size=9, prob=0.6)
w_good <- rbinom(SAMPLE_SIZE, size=9, prob=samples)
simplehist(w_bad)
simplehist(w_good)
Below are some of the hard homework questions from chapter 3
library(rethinking)
data(homeworkch3)
boys <- sum(birth1) + sum(birth2)
total_n_children <- length(birth1) + length(birth2)
## Generating the grid
GRID_LEN = 1000
p_grid <- seq(from=0, to=1, length.out=GRID_LEN)
prior <- rep(1, GRID_LEN) ## Uniform prior distribution
likelihood <- dbinom(boys, size=total_n_children, prob=p_grid)
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
#plotting
plot(posterior ~ p_grid , type="l" )
abline( v=0.5 , lty=2 )
# using the maximum a posteriori (MAP), AKA the mode
p_grid[which.max(posterior)]
## [1] 0.5545546
SAMPLE_SIZE = 1e5
samples <- sample(p_grid, prob=posterior, size=SAMPLE_SIZE, replace=T)
# plot(samples)
# dens(samples)
PI(samples, 0.5)
## 25% 75%
## 0.5305305 0.5785786
PI(samples, 0.89)
## 5% 94%
## 0.4974975 0.6106106
PI(samples, 0.97)
## 2% 98%
## 0.4784785 0.6296296
dummy_b <- rbinom(1e5, 200, prob=samples)
dens(dummy_b, xlab='dummy boy count')
abline( v=sum(birth1)+sum(birth2) , col="red" )
b1sim <- rbinom( 10000 , size=100 , prob=samples )
dens( b1sim, adj=1 )
abline( v=sum(birth1) , col="red" )
birth_following_girl <- birth2[birth1==0]
b01sim <- rbinom(1e5, size=length(birth_following_girl), prob=samples)
dens(b01sim,adj=.1)
abline( v=sum(birth_following_girl) , col="red" )
say you walk a random distance between -1 foot and 1 foot, 16 times. How far do you end up?
Well, it ends up being approximately a Gaussian curve because the sum of the positive and negative steps have the most number of ways (garden of forking data) of being realized, hence the density is largest in those regions (closer to zero). Remember Bayesian inference is just counting.
N_STEPS <- 16
pos <- replicate(100000, sum(runif(N_STEPS,-1,1)))
dens(pos)
n_days <- 16
max_rate <- 0.05
growth <- replicate(1e4, prod(1+runif(n_days,0,max_rate)))
dens(growth)
This is because multiplication at small numbers is essentially addition as the equation below indicates (intuitively). Say bacteria grow at 10% for two days in a row: \[ 1.1 \times 1.1 = 1.21 \]
We could also approximate this product by just adding the increases (10% + 10%) and be off by only 0.01. \[ 1.1 \times 1.1 = (1+01)(1+01) = 1+0.2+ 0.01 \approx 1.2\]
The bigger the rate, the less normal the distribution appears
n_days <- 16
max_rate <- 0.5
growth <- replicate(1e4, prod(1+runif(n_days,0,max_rate)))
dens(growth, norm.comp = T)
But, when the rates get big, we can simply use a log transformation and the distribution becomes Gaussian again:
log.big <- replicate(1e4, log(prod(1 + runif(n_days,0,max_rate))))
dens(log.big, norm.comp = T)
library(rethinking)
data("Howell1")
d <- Howell1
d
d2 <- d[d$age>=18,]
curve(dnorm(x, 178, 20), from=100, to=250)
curve( dunif(x, 0, 50), from=-10, to=60)
SAMPLE_SIZE = 1e4
sample_mu <- rnorm(SAMPLE_SIZE, 178, 20)
sample_sigma <- runif(SAMPLE_SIZE, 0, 50)
prior_h <- rnorm(SAMPLE_SIZE, sample_mu, sample_sigma)
dens(prior_h)
# grid approximation of mu
mu.list <- seq(from=140, to=160, length.out=200)
# grid approximation of sigma
sigma.list <- seq(from=4, to=9, length.out=200)
# cartesian product of grids (THIS IS WHY GRID APPROXIMATION ISN'T SCALEABLE)
post <- expand.grid(mu=mu.list, sigma=sigma.list)
post$LL <- sapply(1:nrow(post), function(i)sum(dnorm(
d2$height,
mean=post$mu[i],
sd=post$sigma[i],
log=TRUE
)))
post$prod <- post$LL + dnorm(post$mu, 178, 20, TRUE) +
dunif(post$sigma, 0, 50, TRUE)
post$prob <- exp(post$prod - max(post$prod))
post
image_xyz(post$mu, post$sigma, post$prob)
- Now, we want to sample from this distribution. We will do this by sampling random rows in proportion to their posterior probability:
sample.rows <- sample(1:nrow(post), size=SAMPLE_SIZE, replace=TRUE, prob=post$prob)
sample.mu <- post$mu[sample.rows]
sample.sigma <- post$sigma[sample.rows]
plot(sample.mu, sample.sigma, cex=.8, pch=1, col=col.alpha(rangi2,0.1))
dens(sample.mu, adj = 1)
dens(sample.sigma, adj=1)
HPDI(sample.mu)
## |0.89 0.89|
## 153.8693 155.1759
HPDI(sample.sigma)
## |0.89 0.89|
## 7.266332 8.195980
quapquap? quap works by using the model definition (the \(h_i, \mu, \sigma\) as defined in the model definition above), to find the posterio probability at each combination of parameter values then it climbs the posterior until it finds the peak, the MAP, maximum a posteriori.library(rethinking)
data(Howell1)
d <- Howell1
d2 <- d [ d$age >= 18, ]
flist <- alist(
height ~ dnorm(mu, sigma),
mu ~ dnorm(178, 20),
sigma ~ dunif(0, 50)
)
m4.1 <- quap(flist, data=d2)
precis(m4.1)
m4.2 <- map(alist(
height ~ dnorm(mu, sigma),
mu ~ dnorm(178, 10),
sigma ~ dunif(0, 50)
), data=d2)
precis(m4.2)
m4.3 <- map(alist(
height ~ dnorm(mu, sigma),
mu ~ dnorm(178, .1),
sigma ~ dunif(0, 50)
), data=d2)
precis(m4.3)
quap fitWhen R computes the quadratic approximation, it has to calculate covariances between all the parameters (as well as the variance of the parameter values themselves).
vcov( m4.1 )
## mu sigma
## mu 0.1697690263 0.0002439593
## sigma 0.0002439593 0.0849424718
These can be converted to standard deviations that match our precis from above:
round(0.1697396837^(1/2),2)
## [1] 0.41
print('this matches 0.41 from the precis output')
## [1] "this matches 0.41 from the precis output"
round(0.0849059133^(1/2),2)
## [1] 0.29
print('this matches 0.29 from the precis output')
## [1] "this matches 0.29 from the precis output"
cov2cor( vcov(m4.1) )
## mu sigma
## mu 1.000000000 0.002031541
## sigma 0.002031541 1.000000000
rethinking package provides a way to do this:library(rethinking)
post <- extract.samples(m4.1, n=1e4)
precis(post)
head(d2)
plot(d2$height ~ d2$weight)
\[ h_i \sim N(\mu_i, \sigma) \\ \mu_i = \alpha + \beta x_i \\ \alpha \sim N(178, 100) \\ \beta \sim N(0, 10) \\ \sigma \sim U(0, 50) \] - This tells us that our - height is a normally distributed with a mean of \(\mu_i\), but that our \(\mu_i\) is not the same all the time - \(\mu_i\) is a function of height where: - \(\alpha\) is the intercept (the height of someone who is 0 pounds…) and has a weak prior of 178 cm with a standard deviation of 100 - \(\beta\) is the strength of association between weight and height and has a weak prior of 0 (no relationship) with equal probability that the relationship is positive or negative (likely a poor choice of a prior) - \(\sigma\) is the same as before
precis(m4.3)
cov2cor(vcov(m4.3))
## a b sigma
## a 1.000000000 -0.989883031 0.001167113
## b -0.989883031 1.000000000 -0.001128566
## sigma 0.001167113 -0.001128566 1.000000000
In this case, the \(\alpha\) makes little sense since it is interpreted as “the height of someone with 0 weight is 113.90 cm.” This is nonsense. If we transform the weight predictor such that it always shows deviations from the average height it becomes more helpful. Also, note that the standard deviation is 5. So about 2 standard deviations on either side (103.9 - 123.9) is the 95% density in the posterior.
So, to fix the intercept interpretation issue, let’s center the data
d2$weight.c <- d2$weight - mean(d2$weight)
head(d2, 3)
weight.c, instead of simply using weight.m4.4 <- quap(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b*weight.c,
a ~ dnorm(178, 100),
b ~ dnorm(0, 10),
sigma ~ dunif(0, 50)
),
data = d2
)
precis(m4.4)
Now, \(\alpha\) represents the mean weight when the height is the mean height.
So let’s plot this (uncentered) model with the data:
plot(height ~ weight, data=d2)
abline(a=coef(m4.3)["a"], b=coef(m4.3)['b'])
- Now we can do it the ggplot way
ggplot(data=d2, aes(x=weight, y=height))+
geom_point() +
geom_smooth(method="lm", formula=y~x)+
theme_minimal()
post <- extract.samples(m4.3)
post
N <- 300
dN <- d2 %>%
slice_sample(n=N)
mN <- rethinking::map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b*weight,
a ~ dnorm(178, 100),
b ~ dnorm(0, 10),
sigma ~ dunif(0, 50)
),
data=dN
)
samples <- 20
post <- extract.samples(mN, n=samples)
plot(dN$weight, dN$height,
xlim=range(d2$weight),ylim=range(d2$height),
col=rangi2, xlab='weight', ylab='height')
mtext(paste0("N = ",N))
for (i in 1:samples)
abline(a=post$a[i], b=post$b[i], col=col.alpha("black",0.3))
mu_at_50 <- post$a + post$b * 50
dens(mu_at_50, col=rangi2, lwd=2, xlab = "mu|weight=50", show.HPDI=T, adj = 1)
HPDI(mu_at_50, prob=0.89)
## |0.89 0.89|
## 158.3914 159.6285
mu <- link(m4.3)
str(mu)
## num [1:1000, 1:352] 157 158 158 157 157 ...
weight.seq <- seq(from=25, to=70, by=1)
mu <- link(m4.3, data=data.frame(weight=weight.seq))
str(mu)
## num [1:1000, 1:46] 137 136 136 137 137 ...
plot(height~weight, d2, type="n")
for (i in 1:1000)
points(weight.seq, mu[i,], pch=16, col=col.alpha(rangi2, 0.5))
mu.mean <- apply(mu,2,mean)
mu.HPDI <- apply(mu,2,HPDI,prob=0.89)
plot(height~weight, d2, col=col.alpha(rangi2, 0.5))
lines(weight.seq, mu.mean)
shade(mu.HPDI, weight.seq)
Let’s backup here a little bit and think about what this plot is showing. It shows us the data points (obviously), it shows us the maximum a posteriori mean height for each weight, and it shows us the 89% confidence interval of the mean for each weight. Think back to our mathstats view of this model (omitting priors for brevity): \[ \space height_i \sim Normal(\mu_i, \sigma) \\ \space \mu_i = \alpha + \beta x_i \]
Notice how there are two parts here: height is linearly related to weight (what we just looked at), but height also has its own variation around the mean (\(\sigma\)). This means that there is uncertainty in all posterior estimates as well as uncertainty in the Gaussian likelihood. Below, we are simulating 1000 heights for each weight between 25 and 70.
sim.height <- sim(m4.3, data=list(weight=weight.seq))
sim.height <- as.tibble(sim.height)
sim.height
height.PI <- apply(sim.height, 2, PI, prob=0.89)
head(as.tibble(t(height.PI)))
# Setting the density interval
height.PI.67 <- apply(sim.height, 2, PI, prob=0.67)
height.PI.89 <- apply(sim.height, 2, PI, prob=0.89)
height.PI.99 <- apply(sim.height, 2, PI, prob=0.99)
plot(height ~ weight, d2, col=col.alpha(rangi2, 0.5))
# plot the regression line
lines(weight.seq, mu.mean)
# shade the HPDI region for the line
shade(mu.HPDI, weight.seq, col = col.alpha(rangi2, 0.5))
# shae the HPDI region for simulated weights
shade(height.PI.67, weight.seq)
# shae the HPDI region for simulated weights
shade(height.PI.89, weight.seq)
# shae the HPDI region for simulated weights
shade(height.PI.99, weight.seq)
This puts it all together. We are showing: 1. The data points 2. The MAP (maximum a posteriori) regression line 3. The uncertainty around the MAP regression line (blue shaded area) 4. The uncertainty around the simulations (as dictated by \(\sigma\)) when 1. \(\sigma = 0.67\) 1. \(\sigma = 0.89\) 1. \(\sigma = 0.99\)
Here is an example model: \[
height_i \sim Normal(\mu_i, \sigma) \\
\mu_i = \alpha +\beta_1x_i + \beta_2x_i^2 + \beta_3x_i^3
\] lets instantiate a model using quap:
# standardizing the weight
d$weight.s <- (d$weight - mean(d$weight))/sd(d$weight)
# engineering the polynomial features
d$weight.s2 <- d$weight.s^2
d$weight.s3 <- d$weight.s^3
# fitting the model
m4.6 <- quap(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b1*weight.s + b2*weight.s2 + b3*weight.s3,
a ~ dnorm(178, 100),
b1 ~ dnorm(0, 10),
b2 ~ dnorm(0, 10),
b3 ~ dnorm(0, 10),
sigma ~ dunif(0, 50)
),
data=d
)
# plotting
plot(height ~ weight.s, d, col=col.alpha(rangi2,0.5), xaxt='n')
at <- c(-2, -1, 0, 1 , 2)
labels <- at*sd(d$weight) + mean(d$weight)
axis(side = 1, at = at, labels = round(labels, 1))
weight.seq <- seq(from=-2.2, to=2, length.out=30)
pred_dat <- list(weight.s = weight.seq, weight.s2=weight.seq^2, weight.s3=weight.seq^3)
mu <- link(m4.6, data=pred_dat)
mu.mean <- apply(mu, 2, mean)
mu.PI <- apply(mu, 2, PI, prob=0.89)
sim.height <- sim(m4.6, data=pred_dat)
height.PI <- apply(sim.height, 2, PI, prob=0.89)
lines(weight.seq, mu.mean)
shade( mu.PI, weight.seq )
shade( height.PI, weight.seq )
We will come back to polynomial regressions later when we look at overfitting and cross validation.
The example that will be used in this chapter is the relationship between marriage rates, divorce rates, and age at marriage.
# loading data
library(rethinking)
data("WaffleDivorce")
d <- WaffleDivorce
# standard scaling the predictor
d$MedianAgeMarriage.s <- (d$MedianAgeMarriage - mean(d$MedianAgeMarriage))/sd(d$MedianAgeMarriage)
# fitting the model
m5.1 <- quap(
alist(
Divorce ~ dnorm(mu, sigma),
mu <- a + bA*MedianAgeMarriage.s,
a ~ dnorm(10,10),
bA ~ dnorm(0,1),
sigma ~ dunif(0, 10)
), data=d
)
precis(m5.1)
# Compute percentile interval of mean
MAM.seq <- seq(from=-3, to=3.5, length.out=30)
mu <- link(m5.1, data=data.frame(MedianAgeMarriage.s=MAM.seq))
mu.PI <- apply(mu, 2, PI)
# Plotting
plot(Divorce ~ MedianAgeMarriage.s, data=d, col=rangi2)
abline(m5.1)
shade(mu.PI, MAM.seq)
d$Marriage.s <- (d$Marriage-mean(d$Marriage))/sd(d$Marriage)
m5.2 <- rethinking::map(
alist(
Divorce ~ dnorm(mu, sigma),
mu <- a + bR*Marriage.s,
a ~ dnorm(10,10),
bR ~ dnorm(0,1),
sigma ~ dunif(0, 10)
),data=d
)
precis(m5.2)
# Compute percentile interval of mean
MR.seq <- seq(from=-3, to=3.5, length.out=30)
mu <- link(m5.2, data=data.frame(Marriage.s=MR.seq))
mu.PI <- apply(mu, 2, PI)
# Plotting
plot(Divorce ~ Marriage.s, data=d, col=rangi2)
abline(m5.2)
shade(mu.PI, MAM.seq)
Much like in the previous sections, we can formalize the notation (mathstats) of a multivariate linear model: \[ Divorce_i \sim Normal(\mu_i, \sigma)\\ \mu_i = \alpha + \beta_RR_i + \beta_AA_i\\ \alpha \sim Normal(10,10)\\ \beta_R \sim Normal(0,1)\\ \beta_A \sim Normal(0,1)\\ \sigma \sim Uniform(0,10) \]
m5.3 <- quap(
alist(
Divorce ~ dnorm(mu, sigma),
mu <- a + bR*Marriage.s + bA*MedianAgeMarriage.s,
a ~ dnorm(10,10),
bR ~ dnorm(0,1),
bA ~ dnorm(0,1),
sigma ~ dunif(0,10)
), data=d
)
p <- precis(m5.3)
p
What this precis and precis plot tells us is that once we know the median age at marriage for a given state, there is little, if any, predictive power in also knowing the marriage rate of that state. This does not mean that marriage rate is useless, it just means that after knowing median age at marriage it is almost useless. If you dont know the median age at marriage, the marriage rate is still useful